38 research outputs found
Spatio-Temporal FAST 3D Convolutions for Human Action Recognition
Effective processing of video input is essential for the recognition of
temporally varying events such as human actions. Motivated by the often
distinctive temporal characteristics of actions in either horizontal or
vertical direction, we introduce a novel convolution block for CNN
architectures with video input. Our proposed Fractioned Adjacent Spatial and
Temporal (FAST) 3D convolutions are a natural decomposition of a regular 3D
convolution. Each convolution block consist of three sequential convolution
operations: a 2D spatial convolution followed by spatio-temporal convolutions
in the horizontal and vertical direction, respectively. Additionally, we
introduce a FAST variant that treats horizontal and vertical motion in
parallel. Experiments on benchmark action recognition datasets UCF-101 and
HMDB-51 with ResNet architectures demonstrate consistent increased performance
of FAST 3D convolution blocks over traditional 3D convolutions. The lower
validation loss indicates better generalization, especially for deeper
networks. We also evaluate the performance of CNN architectures with similar
memory requirements, based either on Two-stream networks or with 3D convolution
blocks. DenseNet-121 with FAST 3D convolutions was shown to perform best,
giving further evidence of the merits of the decoupled spatio-temporal
convolutions
Analyzing Human-Human Interactions: A Survey
Many videos depict people, and it is their interactions that inform us of
their activities, relation to one another and the cultural and social setting.
With advances in human action recognition, researchers have begun to address
the automated recognition of these human-human interactions from video. The
main challenges stem from dealing with the considerable variation in recording
setting, the appearance of the people depicted and the coordinated performance
of their interaction. This survey provides a summary of these challenges and
datasets to address these, followed by an in-depth discussion of relevant
vision-based recognition and detection methods. We focus on recent, promising
work based on deep learning and convolutional neural networks (CNNs). Finally,
we outline directions to overcome the limitations of the current
state-of-the-art to analyze and, eventually, understand social human actions
Play It Back: Iterative Attention for Audio Recognition
A key function of auditory cognition is the association of characteristic
sounds with their corresponding semantics over time. Humans attempting to
discriminate between fine-grained audio categories, often replay the same
discriminative sounds to increase their prediction confidence. We propose an
end-to-end attention-based architecture that through selective repetition
attends over the most discriminative sounds across the audio sequence. Our
model initially uses the full audio sequence and iteratively refines the
temporal segments replayed based on slot attention. At each playback, the
selected segments are replayed using a smaller hop length which represents
higher resolution features within these segments. We show that our method can
consistently achieve state-of-the-art performance across three
audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.Comment: Accepted at IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) 202
Learn to cycle: Time-consistent feature discovery for action recognition
Generalizing over temporal variations is a prerequisite for effective action
recognition in videos. Despite significant advances in deep neural networks, it
remains a challenge to focus on short-term discriminative motions in relation
to the overall performance of an action. We address this challenge by allowing
some flexibility in discovering relevant spatio-temporal features. We introduce
Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs
with similar activations with potential temporal variations. We implement this
idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics,
in conjunction with a temporal gate that is responsible for evaluating the
consistency of the discovered dynamics and the modeled features. We show
consistent improvement when using SRTG blocks, with only a minimal increase in
the number of GFLOPs. On Kinetics-700, we perform on par with current
state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101
and HMDB-51
AdaPool:Exponential Adaptive Pooling for Information-Retaining Downsampling
Pooling layers are essential building blocks of Convolutional Neural Networks
(CNNs) that reduce computational overhead and increase the receptive fields of
proceeding convolutional operations. They aim to produce downsampled volumes
that closely resemble the input volume while, ideally, also being
computationally and memory efficient. It is a challenge to meet both
requirements jointly. To this end, we propose an adaptive and exponentially
weighted pooling method named adaPool. Our proposed method uses a parameterized
fusion of two sets of pooling kernels that are based on the exponent of the
Dice-Sorensen coefficient and the exponential maximum, respectively. A key
property of adaPool is its bidirectional nature. In contrast to common pooling
methods, weights can be used to upsample a downsampled activation map. We term
this method adaUnPool. We demonstrate how adaPool improves the preservation of
detail through a range of tasks including image and video classification and
object detection. We then evaluate adaUnPool on image and video frame
super-resolution and frame interpolation tasks. For benchmarking, we introduce
Inter4K, a novel high-quality, high frame-rate video dataset. Our combined
experiments demonstrate that adaPool systematically achieves better results
across tasks and backbone architectures, while introducing a minor additional
computational and memory overhead
Leaping Into Memories: Space-Time Deep Feature Synthesis
The success of deep learning models has led to their adaptation and adoption
by prominent video understanding methods. The majority of these approaches
encode features in a joint space-time modality for which the inner workings and
learned representations are difficult to visually interpret. We propose LEArned
Preconscious Synthesis (LEAPS), an architecture-agnostic method for
synthesizing videos from the internal spatiotemporal representations of models.
Using a stimulus video and a target class, we prime a fixed space-time model
and iteratively optimize a video initialized with random noise. We incorporate
additional regularizers to improve the feature diversity of the synthesized
videos as well as the cross-frame temporal coherence of motions. We
quantitatively and qualitatively evaluate the applicability of LEAPS by
inverting a range of spatiotemporal convolutional and attention-based
architectures trained on Kinetics-400, which to the best of our knowledge has
not been previously accomplished
Multi-Temporal Convolutions for Human Action Recognition in Videos
Effective extraction of temporal patterns is crucial for the recognition of
temporally varying actions in video. We argue that the fixed-sized
spatio-temporal convolution kernels used in convolutional neural networks
(CNNs) can be improved to extract informative motions that are executed at
different time scales. To address this challenge, we present a novel
spatio-temporal convolution block that is capable of extracting spatio-temporal
patterns at multiple temporal resolutions. Our proposed multi-temporal
convolution (MTConv) blocks utilize two branches that focus on brief and
prolonged spatio-temporal patterns, respectively. The extracted time-varying
features are aligned in a third branch, with respect to global motion patterns
through recurrent cells. The proposed blocks are lightweight and can be
integrated into any 3D-CNN architecture. This introduces a substantial
reduction in computational costs. Extensive experiments on Kinetics, Moments in
Time and HACS action recognition benchmark datasets demonstrate competitive
performance of MTConvs compared to the state-of-the-art with a significantly
lower computational footprint
Learning Class Regularized Features for Action Recognition
Training Deep Convolutional Neural Networks (CNNs) is based on the notion of
using multiple kernels and non-linearities in their subsequent activations to
extract useful features. The kernels are used as general feature extractors
without specific correspondence to the target class. As a result, the extracted
features do not correspond to specific classes. Subtle differences between
similar classes are modeled in the same way as large differences between
dissimilar classes. To overcome the class-agnostic use of kernels in CNNs, we
introduce a novel method named Class Regularization that performs class-based
regularization of layer activations. We demonstrate that this not only improves
feature search during training, but also allows an explicit assignment of
features per class during each stage of the feature extraction process. We show
that using Class Regularization blocks in state-of-the-art CNN architectures
for action recognition leads to systematic improvement gains of 1.8%, 1.2% and
1.4% on the Kinetics, UCF-101 and HMDB-51 datasets, respectively
KARST FEATURES AND RELATED SOCIAL PROCESSES IN THE REGION OF THE VIKOS GORGE AND TYMPHI MOUNTAIN (NORTHERN PINDOS NATIONAL PARK, GREECE)
Due to unfavourable natural conditions (poor soils, lack of water, special relief conditions), karst terrains have always been relatively sparsely populated, and they have been seriously affected by recent depopulation processes. However, the creation of national parks on karst terrains and the recent increase of (geo)tourism may influence and even turn these population trends. Our study examines the validity of this statement in the context of Vikos Gorge and Tymphi Mountain (NW Greece). Geological and geomorphological values are presented first, including Vikos Gorge, the glaciokarst landscape of Tymphi and the particular spherical rock concretions. Digital terrain analysis is used to obtain scientifically based, reliable morphometric parameters about Vikos Gorge, and it is found that the maximum gorge depth is 1144 m, the maximum width is 2420 m, and the maximum of depth/width ratio is 0.76. Thereafter, rural depopulation trends are examined and it is found that this region (Zagori) is seriously affected by depopulation. It is observed that there are differences among settlements, and a relative stabilization of population is sensible in only few settlements around Vikos Gorge, which are linked to tourism. As for nature protection, while at the beginning conflicts were perceptible among management and local people, now new conflicts are emerging between growing tourism and nature protection goals.Key words: gorge morphometry, glaciokarst, spherical concretions, rural depopulation, geotourism, national park.Kraške oblike in s krasom povezane družbene spremembe na območju soteske Vikos ter v gorovju Timfi (narodni park Severni Pindi, Grčija)Zaradi neugodnih naravnih razmer, kot so manj rodovitna prst, pomanjkanje vode in svojstvena oblikovanost površja, je kraško površje od nekdaj relativno redko poseljeno, v zadnjem obdobju pa je podvrženo tudi procesom odseljevanja. V zadnjem času je vse več tudi geoturizma in z njim povezanega ustanavljanja geoparkov, ki trend depopulacije lahko tudi obrnejo. Pričujoča študija se nanaša na vrednotenje omenjenih procesov na primeru doline Vikos in gorovja Timfi (SZ Grčija). Najprej so predstavljene geomorfološke in geološke značilnosti območja, kjer so izpostavljene značilnosti soteske Vikos, glaciokras gorovja Timfi in za to območje značilne okrogle skalne konkrecije. Digitalni model višin je bil namenjen morfometrični analizi soteske Vikos. Ta je pokazala, da je njena največja globina 1144 m, največja širina 2420 m, največje razmerje med globino in širino pa 0,76. V nadaljevanju so analizirane značilnosti odseljevanja s podeželja, katerim je najbolj podvrženo območje Zagori. Demografski vzorci kažejo na razlike med posameznimi naselji, kjer je število prebivalcev stabilno le v nekaterih naseljih v bližini turistično zanimive soteske Vikos. Z ustanovitvijo parka so se pojavili tudi konflikti. V začetku so se navzkrižja interesov pojavila med lokalnim prebivalstvom in upravljavci, sedaj pa se konflikt pojavlja ob istočasnem naraščanju turizma in želji po varovanju narave.Kjučne besede: morfometrija soteske, glaciokras, okrogle konkrecije, depopulacija podeželja, geoturizem, narodni park.
The efficacy of Equine Assisted Therapy intervention in gross motor function, performance, and spasticity in children with Cerebral Palsy
PurposeTo evaluate the efficacy of Equine Assisted Therapy in children with Cerebral Palsy, in terms of gross motor function, performance, and spasticity as well as whether this improvement can be maintained for 2 months after the end of the intervention.MethodsChildren with Cerebral Palsy participated in this prospective cohort study. The study lasted for 28 weeks, of which the equine assisted therapy lasted 12 weeks taking place once a week for 30 min. Repeated measures within the subject design were used for the evaluation of each child’s physical performance and mental capacity consisting of six measurements: Gross Motor Function Measure-88 (GMFM-88), Gross Motor Performance Measure (GMPM), Gross Motor Function Classification System (GMFCS), Modified Ashworth Scale (MAS) and Wechsler Intelligence Scale for Children (WISC III).ResultsStatistically significant improvements were achieved for 31 children in Gross Motor Function Measure and all its subcategories (p < 0.005), also in total Gross Motor Performance Measure and all subcategories (p < 0.005). These Gross Motor Function Measure results remained consistent for 2 months after the last session of the intervention. Regarding spasticity, although an improving trend was seen, this was not found to be statistically significant.Conclusion and implicationsEquine Assisted Therapy improves motor ability (qualitatively and quantitatively) in children with Cerebral Palsy, with clinical significance in gross motor function